Skip to content

feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel#2792

Merged
bkryu merged 3 commits intoflashinfer-ai:mainfrom
elvischenv:elvischenv/support-rope-fusion-token-padding
Apr 9, 2026
Merged

feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel#2792
bkryu merged 3 commits intoflashinfer-ai:mainfrom
elvischenv:elvischenv/support-rope-fusion-token-padding

Conversation

@elvischenv
Copy link
Copy Markdown
Contributor

@elvischenv elvischenv commented Mar 16, 2026

📌 Description

vLLM is using seqlen=0 padding tokens for running a full cudagraph: https://github.com/vllm-project/vllm/blob/95c0f928cdeeaa21c4906e73cee6a156e1b3b995/vllm/v1/worker/gpu/model_runner.py#L652-L654

Update the following functions:
get_batch_indices_positions_kernel: initialize batch_indices/positions to -1/0 for recognizing the padding tokens
rope_quantize_fp8_append_paged_kv_cache: skip those padding tokens

Testing:
pytest -v -s tests/attention/test_rope.py::test_rope_quantize_fp8_append_paged_kv_cache_padding

======= 4 passed in 1.66s =======

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes

    • Paged KV cache append now reliably skips padding entries, preventing unintended overwrites during attention quantization and positional encoding processing.
    • Deterministic initialization of internal index/position buffers to avoid uninitialized memory affecting padding behavior.
  • Tests

    • Added parameterized regression test confirming padding entries are ignored and prefilled KV cache data remains exactly unchanged.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the KV cache update mechanism to gracefully handle padding tokens, which is crucial for enabling full CUDA graph functionality in systems like vLLM. By explicitly marking and skipping padding tokens during KV cache writes, the changes prevent data corruption and ensure the integrity of the cache while maintaining performance benefits of fixed-size batches.

Highlights

  • Padding Token Support: Introduced support for padding tokens within the rope+quant+kv cache update fusion kernel to align with vLLM's full CUDA graph requirements.
  • Kernel Updates: The get_batch_indices_positions_kernel now explicitly marks padding tokens with batch_indices = -1 and positions = 0. The rope_quantize_fp8_append_paged_kv_cache kernel has been updated to skip processing these marked padding tokens.
  • Test Coverage: A new test case, test_rope_quantize_fp8_append_paged_kv_cache_padding, was added to ensure that padding tokens do not corrupt the KV cache, simulating a decode batch with padded requests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/page.py
    • Updated get_batch_indices_positions to pass the nnz argument to the Triton kernel.
  • flashinfer/triton/page.py
    • Modified get_batch_indices_positions_kernel to accept nnz and to fill padding entries with batch_indices=-1 and positions=0.
  • include/flashinfer/pos_enc.cuh
    • Added a conditional check in RopeQuantizeAppendPagedKVCacheKernel to return early if batch_indices is less than 0, effectively skipping padding tokens.
  • tests/attention/test_rope.py
    • Added test_rope_quantize_fp8_append_paged_kv_cache_padding to validate that padding tokens do not corrupt the KV cache.
Activity
  • The author has indicated that pre-commit checks have been installed and run, and tests have been added or updated as needed, with all tests passing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Deterministically initialize batch_indices and positions buffers and add a kernel-level guard so tokens with batch_indices = -1 are skipped, preventing RoPE, quantization, and paged KV cache append work for padding tokens.

Changes

Cohort / File(s) Summary
Tensor Initialization
flashinfer/page.py
get_batch_indices_positions now initializes batch_indices to -1 (instead of uninitialized) when allocating and fills provided batch_indices with -1; positions is initialized to zeros when not provided.
Kernel-Level Padding Guards
include/flashinfer/pos_enc.cuh
RopeQuantizeAppendPagedKVCacheKernel token-processing block is wrapped in if (batch_indices[idx] >= 0) so page/entry computation, RoPE cos/sin loads, quantization, and paged KV cache append paths are skipped for padding indices.
Test Coverage
tests/attention/test_rope.py
Added test_rope_quantize_fp8_append_paged_kv_cache_padding which constructs paged-KV metadata with padded requests, asserts batch_indices padding markers, invokes the kernel, and verifies padded cache entries remain byte-identical to prefilled snapshots across attention types/layouts.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host (CPU)
    participant Kernel as RopeQuantizeAppendPagedKVCacheKernel (GPU)
    participant KV as Paged KV Cache
    Host->>Host: prepare inputs (batch_indices, positions)
    Host->>Kernel: launch kernel with inputs
    Kernel->>Kernel: compute global idx
    alt batch_indices[idx] >= 0
        Kernel->>Kernel: compute page/entry\napply RoPE, quantize
        Kernel->>KV: append/store K/V/Q into paged cache
    else batch_indices[idx] < 0
        Kernel->>Kernel: skip RoPE/quantize/cache ops
    end
    Kernel-->>Host: kernel completes
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • yzh119
  • kahyunnam
  • bkryu
  • nvmbreughe
  • jiahanc

Poem

🐰
I mark the padded hops with -1 bright,
So kernels skip the places out of sight,
RoPE stays neat, the cache keeps its lore,
No stray bytes tumble — calm on the floor,
A tiny hop for correctness tonight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding support for padding tokens with seqlen=0 in the rope+quant+kv cache update fusion kernel.
Description check ✅ Passed The PR description covers the motivation (vLLM usage), code changes (two updated functions), test results (4 passed), and completed checklist items. All critical sections from the template are addressed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for padding tokens in the rope+quant+kv cache update fused kernel, which is useful for cudagraphs. The approach involves modifying get_batch_indices_positions_kernel to mark padding tokens and updating RopeQuantizeAppendPagedKVCacheKernel to skip them. A new test case is added to validate this padding logic. While the implementation changes seem correct, I've identified issues in the new test case where token positions are calculated incorrectly. This could cause the test to pass while not properly verifying the intended behavior, potentially masking bugs. I've provided suggestions to correct the test logic.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/attention/test_rope.py (1)

1390-1590: Add enable_pdl coverage to this new padding regression test.

Lines 1392-1589 only exercise the default path. Please parameterize enable_pdl and pass it into the fused call so padding behavior is validated under the programmatic dependent launch mode too.

Proposed test update
 `@pytest.mark.parametrize`("kv_layout", ["NHD", "HND"])
 `@pytest.mark.parametrize`("page_size", [16])
+@pytest.mark.parametrize("enable_pdl", [True, False])
 def test_rope_quantize_fp8_append_paged_kv_cache_padding(
@@
     kv_layout,
     page_size,
+    enable_pdl,
 ):
@@
     flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(
@@
         quant_scale_kv=1.0,
         is_neox=False,
+        enable_pdl=enable_pdl,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_rope.py` around lines 1390 - 1590, The test
test_rope_quantize_fp8_append_paged_kv_cache currently only runs the default
path; add a pytest parameterization for enable_pdl (e.g.,
`@pytest.mark.parametrize`("enable_pdl",[False,True]) and add enable_pdl to the
test signature) and pass enable_pdl into the fused call
flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(enable_pdl=enable_pdl,
... ) so the padding behavior is validated under programmatic dependent launch
mode as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-865: Replace the early-return on the batch_indices check so
all threads reach the PDL epilogue: remove the "if (batch_indices[idx] < 0)
return;" and instead wrap the work body that follows (the block currently
between lines 867–1030) in "if (batch_indices[idx] >= 0) { ... }". Keep the
final epilogue (including the griddepcontrol.launch_dependents instruction)
outside that guard so it executes unconditionally for every thread in the block;
ensure you reference the same batch_indices and idx variables used in the
kernel.

---

Nitpick comments:
In `@tests/attention/test_rope.py`:
- Around line 1390-1590: The test test_rope_quantize_fp8_append_paged_kv_cache
currently only runs the default path; add a pytest parameterization for
enable_pdl (e.g., `@pytest.mark.parametrize`("enable_pdl",[False,True]) and add
enable_pdl to the test signature) and pass enable_pdl into the fused call
flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(enable_pdl=enable_pdl,
... ) so the padding behavior is validated under programmatic dependent launch
mode as well.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6761f8e1-64cf-4451-8fe3-a2914f272b95

📥 Commits

Reviewing files that changed from the base of the PR and between b418bc3 and 476de9c.

📒 Files selected for processing (4)
  • flashinfer/page.py
  • flashinfer/triton/page.py
  • include/flashinfer/pos_enc.cuh
  • tests/attention/test_rope.py

Comment on lines +862 to +863
// skip padding tokens with batch_indices < 0
if (batch_indices[idx] >= 0) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change is just this line. The following is just indent formatting.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
include/flashinfer/pos_enc.cuh (1)

862-868: Make padding-sentinel check type-safe across PagedKVIdType instantiations.

Line 863 uses batch_indices[idx] >= 0, which is only safe when PagedKVIdType is signed. If it is ever unsigned, the padding sentinel -1 becomes the maximum value and this branch incorrectly passes, leading to invalid indptr indexing.

🔧 Proposed fix
-    // skip padding tokens with batch_indices < 0
-    if (batch_indices[idx] >= 0) {
+    constexpr PagedKVIdType kPaddingSentinel = static_cast<PagedKVIdType>(-1);
+    const PagedKVIdType batch_idx = batch_indices[idx];
+    if (batch_idx != kPaddingSentinel) {
       // Compute page location for this token
       uint32_t page_iter, entry_idx;
       paged_kv_like.page_size.divmod(
-          paged_kv_like.indptr[batch_indices[idx]] * paged_kv_like.page_size + positions[idx],
+          paged_kv_like.indptr[batch_idx] * paged_kv_like.page_size + positions[idx],
           page_iter, entry_idx);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/pos_enc.cuh` around lines 862 - 868, The current padding
check uses "batch_indices[idx] >= 0" which breaks for unsigned PagedKVIdType;
change it to a type-safe sentinel comparison (e.g., compare against a named
padding sentinel value) so you explicitly test for the padding marker instead of
signedness. Replace the condition in the block using batch_indices, positions
and paged_kv_like (the if around batch_indices[idx] >= 0 that precedes
paged_kv_like.indptr[...] and page_size.divmod calls) with a check like
"batch_indices[idx] != static_cast<PagedKVIdType>(-1)" or, better, introduce a
constexpr PagedKVIdType kPaddingSentinel = static_cast<PagedKVIdType>(-1) and
use "batch_indices[idx] != kPaddingSentinel" to ensure correct behavior for both
signed and unsigned instantiations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-868: The current padding check uses "batch_indices[idx] >= 0"
which breaks for unsigned PagedKVIdType; change it to a type-safe sentinel
comparison (e.g., compare against a named padding sentinel value) so you
explicitly test for the padding marker instead of signedness. Replace the
condition in the block using batch_indices, positions and paged_kv_like (the if
around batch_indices[idx] >= 0 that precedes paged_kv_like.indptr[...] and
page_size.divmod calls) with a check like "batch_indices[idx] !=
static_cast<PagedKVIdType>(-1)" or, better, introduce a constexpr PagedKVIdType
kPaddingSentinel = static_cast<PagedKVIdType>(-1) and use "batch_indices[idx] !=
kPaddingSentinel" to ensure correct behavior for both signed and unsigned
instantiations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a25b44d8-52fc-4514-8eeb-44210f1c0877

📥 Commits

Reviewing files that changed from the base of the PR and between 476de9c and 54c5fff.

📒 Files selected for processing (1)
  • include/flashinfer/pos_enc.cuh

@elvischenv elvischenv force-pushed the elvischenv/support-rope-fusion-token-padding branch 2 times, most recently from f0f6c80 to 63197ac Compare March 16, 2026 07:29
@elvischenv elvischenv changed the title feat: Support padding token for rope+quant+kv cache update fusion kernel feat: Support 0 seqlen padding tokens for rope+quant+kv cache update fusion kernel Mar 16, 2026
@elvischenv elvischenv changed the title feat: Support 0 seqlen padding tokens for rope+quant+kv cache update fusion kernel feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel Mar 16, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
include/flashinfer/pos_enc.cuh (1)

862-863: Consider adding a compile-time assertion to document the signed type requirement for the padding check.

The code assumes PagedKVIdType is signed to make the check batch_indices[idx] >= 0 work correctly for detecting padding tokens (marked with -1). While the type is hardcoded to int32_t in the JIT binding at csrc/rope.cu (lines 574, 605) and in Python via torch.int(), a static assertion would document this requirement and guard against accidental misuse:

static_assert(std::is_signed<PagedKVIdType>::value,
              "PagedKVIdType must be signed for padding token detection (batch_indices < 0)");

This defensive check aligns with the kernel's assumptions and improves long-term maintainability without adding runtime overhead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/pos_enc.cuh` around lines 862 - 863, Add a compile-time
assertion that PagedKVIdType is a signed type to document and enforce the
kernel's assumption used by the padding check `batch_indices[idx] >= 0`; insert
a static_assert using `std::is_signed<PagedKVIdType>::value` (e.g., near the
typedef/using of PagedKVIdType or at the top of the kernel in pos_enc.cuh before
the `batch_indices` usage) with a clear message like "PagedKVIdType must be
signed for padding token detection (batch_indices < 0)"; this is purely
compile-time and has no runtime overhead but prevents accidental unsigned types
from being used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-863: Add a compile-time assertion that PagedKVIdType is a
signed type to document and enforce the kernel's assumption used by the padding
check `batch_indices[idx] >= 0`; insert a static_assert using
`std::is_signed<PagedKVIdType>::value` (e.g., near the typedef/using of
PagedKVIdType or at the top of the kernel in pos_enc.cuh before the
`batch_indices` usage) with a clear message like "PagedKVIdType must be signed
for padding token detection (batch_indices < 0)"; this is purely compile-time
and has no runtime overhead but prevents accidental unsigned types from being
used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: de20e129-16a7-4945-80ef-553ab0f8df70

📥 Commits

Reviewing files that changed from the base of the PR and between 54c5fff and 63197ac.

📒 Files selected for processing (3)
  • flashinfer/page.py
  • include/flashinfer/pos_enc.cuh
  • tests/attention/test_rope.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • flashinfer/page.py
  • tests/attention/test_rope.py

@elvischenv
Copy link
Copy Markdown
Contributor Author

Hi @yzh119, could you help review this? We need this fix for integrating this kernel to vLLM. Thanks!

@elvischenv
Copy link
Copy Markdown
Contributor Author

cc @kahyunnam for viz.

@bkryu bkryu added the run-ci label Mar 20, 2026
@bkryu
Copy link
Copy Markdown
Collaborator

bkryu commented Mar 20, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !438 has been created, and the CI pipeline #46584451 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[SUCCESS] Pipeline #46584451: 14/20 passed

@nvpohanh
Copy link
Copy Markdown
Contributor

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !438 has been created, and the CI pipeline #46776615 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #46776615: 12/20 passed

@elvischenv elvischenv force-pushed the elvischenv/support-rope-fusion-token-padding branch from 63197ac to 832ac30 Compare March 24, 2026 12:28
@nvpohanh
Copy link
Copy Markdown
Contributor

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !438 has been updated with latest changes, and the CI pipeline #47263242 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47263242: 13/20 passed

@nvpohanh
Copy link
Copy Markdown
Contributor

@elvischenv could you rebase again?

@elvischenv elvischenv force-pushed the elvischenv/support-rope-fusion-token-padding branch from 09fad56 to c182b4c Compare March 31, 2026 01:49
@nvpohanh
Copy link
Copy Markdown
Contributor

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !438 has been updated with latest changes, and the CI pipeline #47308264 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47308264: 7/20 passed

@nvpohanh
Copy link
Copy Markdown
Contributor

@bkryu Could you review this or assign someone to review this? Thanks!

@nvpohanh
Copy link
Copy Markdown
Contributor

nvpohanh commented Apr 7, 2026

@bkryu could you review this? Thanks

@bkryu
Copy link
Copy Markdown
Collaborator

bkryu commented Apr 7, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !438 has been updated with latest changes, and the CI pipeline #47955669 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @elvischenv, the changes generally look correct and the CI does seem to pass.

However, Have you tried measuring the performance implications? Asking because torch.full or torch.zeros tend to call memset which have a higher overhead than torch.empty. I'm wondering whether there will be a noticeable performance difference from it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elvischenv is this called per decoding step or per attention layer? If this is per decoding step, I am less worried about the additional memsets. But if this is per-layer, it may be noticeable perf overhead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_batch_indices_positions is a helper function, preparing the needed arguments for rope_quantize_fp8_append_paged_kv_cache, should only be called per decoding step. Then the whole iteration can reuse the same batch_indices and positions, which won't produce the noticeable overhead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkryu Once per decoding step should be okay? Do you agree?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should be fine.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47955669: 11/20 passed

@nvpohanh
Copy link
Copy Markdown
Contributor

nvpohanh commented Apr 9, 2026

@bkryu Are the failures known ones or caused by this PR?

Copy link
Copy Markdown
Collaborator

@bkryu bkryu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failures are unrelated. LGTM!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should be fine.

@bkryu bkryu merged commit b705b67 into flashinfer-ai:main Apr 9, 2026
40 of 60 checks passed
@elvischenv elvischenv deleted the elvischenv/support-rope-fusion-token-padding branch April 12, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants